Image insertion in video streams using a combination of physical sensors and pattern recognition

ABSTRACT

A live video insertion system (LVIS) is disclosed that allows insertion of static or dynamic images into a live video broadcast in a realistic fashion on a real time basis. Initially, natural landmarks in a scene that are suitable for subsequent detection and tracking are selected. The landmarks are typically distributed throughout the entire scene, such as a ballpark or football stadium. The field of view of the camera at any instant is normally significantly smaller than the full scene that may be panned. The LVIS uses a combination of pattern recognition techniques and camera sensor data (e.g., pan, tilt, zoom, etc.) to locate, verify and track target data. Camera sensors are well suited for the searching requirements of an LVIS, while pattern recognition and landmark tracking techniques are better suited for the image tracking requirements of LVIS.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is related to and claims the benefit of U.S.Provisional Application Ser. No. 60/038,143 filed on Nov. 27, 1996entitled "IMAGE INSERTION IN VIDEO STREAMS USING A COMBINATION OFPHYSICAL SENSORS AND PATTERN RECOGNITION".

The present application is also related to the following co-pendingcommonly owned applications: Ser. No. 08/563,598 filed Nov. 28, 1995entitled "SYSTEM AND METHOD FOR INSERTING STATIC AND DYNAMIC IMAGES INTOA LIVE VIDEO BROADCAST"; Ser. No. 08/580,892 filed Dec. 29, 1995entitled "METHOD OF TRACKING SCENE MOTION FOR LIVE VIDEO INSERTIONSYSTEMS"; Ser. No. 08/662,089 filed Jun. 12, 1996 entitled "SYSTEM ANDMETHOD OF REAL-TIME INSERTIONS INTO VIDEO USING ADAPTIVE OCCLUSION WITHA SYNTHETIC COMMON REFERENCE IMAGE"; and Ser. No. 60/031,883 filed Nov.27, 1996 entitled "CAMERA TRACKING USING PERSISTANT, SELECTED, IMAGETEXTURE TEMPLATES" The foregoing applications are all incorporatedherein by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to a system and method for tracking image framesfor inserting realistic indicia into video images.

2. Description of Related Art

Electronic devices for inserting electronic images into live videosignals, such as described in U.S. Pat. No. 5,264,933 by Rosser, et al.,have been developed and used for the purpose of inserting advertisingand other indicia into broadcast events, primarily sports events. Thesedevices are capable of seamlessly and realistically incorporating logosor other indicia into the original video in real time, even as theoriginal scene is zoomed, panned, or otherwise altered in size orperspective. Other examples include U.S. Pat. No. 5,488,675 issued toHanna and U.S. Pat. No. 5,491,517 issued to Kreitman, et al.

Making the inserted indicia look as if it is actually in the scene is animportant but difficult aspect of implementing the technology. Atroublesome aspect is that the eye of the average viewer is verysensitive to small changes in the relative position of objects fromfield to field. Experimentally, instances have been found where relativemotion of an inserted logo by as little as one tenth of one pixel of anNTSC television image is perceptible to a viewer. Placing, andconsistently maintaining to a high precision, an inserted indicia in abroadcast environment is crucial in making video insertion technologycommercially viable. A broadcast environment includes image noise, thepresence of sudden rapid camera motion, the sporadic occurrence ofmoving objects which may obscure a considerable fraction of the image,distortions in the image due to lens characteristics and changing lightlevels, induced either by natural conditions or by operator adjustment,and the vertical interlacing of television signals.

In the prior art, the automatic tracking of image motion has generallybeen performed by two different methods.

The first method utilizes pattern recognition of the frames and examinesthe image itself and either follows known landmarks in the video scene,using correlation or difference techniques, or calculates motion usingwell known techniques of optical flow. See, Horn, B. K. P. and Schunck,B. G., "Determining Optical Flow", Artificial Intelligence, pp 185-203(1981). Landmarks may be transient or permanent and may be a naturalpart of the scene or introduced artificially. A change in shape and poseof the landmarks is measured and used to insert the required indicia.

The second method, described, for instance, in U.S. Pat. No. 4,084,184issued to D. W. Crain, uses sensors placed on the camera to providefocal distance, bearing and elevation information. These sensors existto provide similar landmark positional data within a given camera'sfield of view.

Pattern Recognition Systems

In the pattern recognition type of image insertion systems developed byRosser et al., for instance, the system has two distinct modes. First isthe search mode wherein each new frame of live video is searched inorder to detect and verify a particular target image. Second is thetracking mode, in which the system knows that in the previous frame ofvideo the target image was present. The system further knows thelocation and orientation of that previous frame with respect to somepre-defined reference coordinate system. The target image locations aretracked and updated with respect to the pre-defined reference coordinatesystem.

The search mode encompasses pattern recognition techniques to identifycertain images. Obtaining positional data via pattern recognition, asopposed to using camera sensors, provides significant system flexibilitybecause it allows live video insertion systems to make an insertion atany point in the video broadcast chain. For instance, actual insertioncan be performed at a central site which receives different video feedsfrom stadiums or arenas around the country or world. The various feedscan be received via satellite or cable or any other means known in theart. Once the insertion is added, the video feed can be sent back viasatellite or cable to the broadcast location where it originated, ordirectly to viewers.

Such pattern recognition search and tracking systems, however, aredifficult to implement for some events and are the most vulnerableelement prone to error during live video insertion system operation. TheAssignee herein, Princeton Video Image, Inc., has devised and programmedrobust searches for many venues and events such as baseball, football,soccer and tennis. However, the time and cost to implement similarsearch algorithms can be prohibitive for other types of events. Patternrecognition searching is difficult for events in which major changes tothe look of the venue are made within hours, or even days, of the event.This is because a pre-defined common reference image of the venue isdifficult to obtain since the look of the venue is not permanently set.In such cases a more robust approach to the search problem is to utilizesensors attached to one or more of the cameras to obtain targetpositional data.

Camera Sensor Systems

The drawbacks of relying solely upon camera sensor systems are detailedbelow. In field trials with televised baseball and football games,previous systems encountered the following specific, major problems.

1. Camera Motion

In a typical sport, such as football or baseball, close up shots aretaken with long focal length cameras operating at a distance of up toseveral hundred yards from the action. Both of these sports have suddenaction, namely the kicking or hitting of a ball, which results in thegame changing abruptly from a tranquil scene to one of fast movingaction. As the long focal length cameras react to this activity, theimage they record displays several characteristics which render motiontracking more difficult. For example, the motion of the image may be asfast as ten pixels per field. This will fall outside the range ofsystems that examine pixel windows that are less than 10 by 10 pixels.Additionally, the images may become defocused and suffer severe motionblurring, such that a line which in a static image is a few pixels wide,blurs out to be 10 pixels wide. This means that a system tracking anarrow line, suddenly finds no match or makes assumptions such as thezoom has changed when in reality only fast panning has occurred. Thismotion blurring also causes changes in illumination level and color, aswell as pattern texture, all of which can be problems for systems usingpattern based image processing techniques. Camera motion, even in aslittle as two fields, results in abrupt image changes in the local andlarge scale geometry of an image. An image's illumination level andcolor are affected by camera motion as well.

2. Moving Objects

Sports scenes generally have a number of participants, whose generalmotion follows some degree of predictability, but who may at any timesuddenly do something unexpected. This means that any automatic motiontracking of a real sports event has to be able to cope with sudden andunexpected occlusion of various parts of the image. In addition, thevariety of uniforms and poses adopted by players in the course of agame, mean that attempts to follow any purely geometric pattern in thescene have to be able to cope with a large number of occurrences ofsimilar patterns.

3. Lens Distortion

All practical camera lenses exhibit some degree of geometric lensdistortion which changes the relative position of objects in an image asthose objects move towards the edge of an image. When 1/10th of a pixelaccuracy is required, this can cause problems.

4. Noise in the Signal

Real television signals exhibit noise, especially when the cameras areelectronically boosted to cover low light level events, such as nighttime baseball. This noise wreaks havoc with image analysis techniqueswhich rely on standard normalized correlation recognition, as thesematch pattern shapes, irrespective of the strength of the signal.Because noise shapes are random, in the course of several hundredthousand fields of video (or a typical three hour game), the chances ofmistaking noise patterns for real patterns can be a major problem.

5. Field-to-Field Interlace

Television images, in both NTSC and PAL standards, are transmitted intwo vertically interlaced fields which together make up a frame. Thismeans that television is not a single stream of images, but two streamsof closely related yet subtly different images. The problem isparticularly noticeable in looking at narrow horizontal lines, which maybe very evident in one field but not the other.

6. Illumination and Color Chance

Outdoor games are especially prone to illumination and color changes.Typically, a summer night baseball game will start in bright sunlightand end in floodlight darkness. An illumination change of a factor ofmore than two is typical in such circumstances. In addition the changefrom natural to artificial lighting changes the color of the objects inview. For instance, at Pro Player Park in Florida, the walls appear blueunder natural lighting but green under artificial lighting.

7. Setup Differences

Cameras tend to be set up with small but detectable differences fromnight to night. For instance, camera tilt typically varies by up to plusor minus 1%, which is not immediately obvious to the viewer. However,this represents plus or minus 7 pixels and can be a problem to typicaltemplates measuring 8 pixels by 8 pixels.

The advantages of camera sensors include the ability to be reasonablysure of which camera is being used and where it is pointing and at whatmagnification the camera is viewing the image. Although there may beinaccuracies in the camera sensor data due to inherent mechanicaluncertainties, such as gear back-lash, these inaccuracies will never belarge, a camera sensor system will, for instance, not miss-recognize anumpire as a goal post, or "think" that a zoomed out view of a stadium isa close up view of the back wall. It will also never confuse motion ofobjects in the foreground as being movement of the camera itself.

What is needed is a system that combines the advantages of both patternrecognition systems and camera sensor systems for searching and trackingscene motion while eliminating or minimizing the disadvantages of each.The primary difficulty in implementing a pattern recognition/camerasensor hybrid insertion system is the combining and/or switching betweendata obtained by the two completely different methods. If not donecorrectly, the combination or switch over gives unstable results whichshow up as the inserted image jerking or vibrating within the overallimage. Overcoming this difficulty is crucial to making a hybrid systemwork well enough for broadcast quality.

SUMMARY OF THE INVENTION

By way of background, an LVIS, or live video insertion system, isdescribed in commonly owned application Ser. No. 08/563,598 filed Nov.28, 1995 entitled "SYSTEM AND METHOD FOR INSERTING STATIC AND DYNAMICIMAGES INTO A LIVE VIDEO BROADCAST". An LVIS is a system and method forinserting static or dynamic images into a live video broadcast in arealistic fashion on a real time basis. Initially, natural landmarks ina scene that are suitable for subsequent detection and tracking areselected. Landmarks preferably comprise sharp, bold, and clear vertical,horizontal, diagonal or corner features within the scene visible to thevideo camera as it pans and zooms. Typically, at least three or morenatural landmarks are selected. It is understood that the landmarks aredistributed throughout the entire scene, such as a baseball park or afootball stadium, and that the field of view of the camera at anyinstant is normally significantly smaller than the full scene that maybe panned. The landmarks are often located outside of the destinationpoint or area where the insert will be placed because the insert area istypically too small to include numerous identifiable landmarks and theinsertable image may be a dynamic one and, therefore, it has no single,stationary target destination.

The system models the recognizable natural landmarks on a deformabletwo-dimensional grid. An arbitrary, non-landmark, reference point ischosen within the scene. The reference point is mathematicallyassociated with the natural landmarks and is subsequently used to locatethe insertion area.

Prior to the insertion process, artwork of the image to be inserted isadjusted for perspective, i.e., shape. Because the system knows themathematical relationship between the landmarks in the scene, it canautomatically determine the zoom factor and X, Y position adjustmentthat must be applied. Thereafter, when the camera zooms in and out andchanges its field of view as it pans, the insertable image remainsproperly scaled and proportioned with respect to the other features inthe field of view so that it looks natural to the home viewer. Thesystem can pan into and out of a scene and have the insertable imagenaturally appear in the scene rather than "pop up" as has been the casewith some prior art systems. The system can easily place an insertableimage at any location.

The present invention is a hybrid live video insertion system (LVIS)using a combination of pattern recognition techniques just described aswell as others and camera sensor data to locate, verify and track targetdata. Camera sensors are well suited to the search and detection, i.e.recognition, requirements of an LVIS while pattern recognition andlandmark tracking techniques, including co-pending provisionalapplication Ser. No. 60/031,883 filed Nov. 27, 1996 entitled "CAMERATRACKING USING PERSISTANT, SELECTED, IMAGE TEXTURE TEMPLATES", arebetter suited for the image tracking requirements of an LVIS.

The concept behind the present invention is to combine camera sensordata and optical pattern technology so that the analysis of the videoimage stabilizes and refines the camera sensor data. This stabilizationand refinement can be done by substituting the camera sensor data forthe prediction schemes used by standard LVIS systems for searching forand tracking landmark data, or by using the camera sensor data as yetanother set of landmarks, with appropriate weighting function, in themodel calculation performed by standard LVIS systems. Once the camerasensors have acquired the requisite data corresponding to landmarks inthe scene, the data is converted to a format that is compatible with andusable by the tracking functions of the standard LVIS and the rest ofthe insertion process is carried out normally.

Thus, the present invention takes advantage of camera sensor data toprovide an LVIS with robust search capability independent of the detailsof the event location. Moreover, many of the disadvantages pertaining tocamera sensor systems as described above are overcome.

The present invention comprises a typical LVIS in which one or moreevent cameras include sensors for sensing the zoom and focus of thelens, and the pan and tilt of the camera with respect to a fixedplatform. For cameras in unstable locations, additional sensors areincluded which measure the motion of the substantially fixed platformwith respect to a more stable stadium reference. For hand-held or mobilecameras, a still further set of sensors are included for measuringcamera location and orientation with respect to a pre-determined set ofreference positions. Sensor data from each camera, along with tally datafrom the production switcher, if necessary, is used by the LVIS tosearch for and detect landmark data and thereby provide a coarseindication of where an insertion should occur in the current image.Tally data takes the form of an electronic signal indicating whichcamera or video source is being output as the program feed by the videoswitcher.

The sensors and tally data essentially replace the search mode ofconventional pattern recognition live video insertion systems. Anaccurate final determination of an insertion location is determined byusing feature and/or texture analysis in the actual video image. Thisanalysis compares the position of the features and/or texture within thevideo frame to their corresponding location in a common reference imageor previous image of the insertion location and surroundings asdescribed in co-pending applications Ser. No. 08/580,892 filed Dec. 29,1995 entitled "METHOD OF TRACKING SCENE MOTION FOR LIVE VIDEO INSERTIONSYSTEMS" and 60/031,883 filed Nov. 27, 1996 entitled "CAMERA TRACKINGUSING PERSISTANT, SELECTED, IMAGE TEXTURE TEMPLATES".

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic representation showing a reference video image ofa scene.

FIG. 2 is a schematic representation showing a live video image of thereference video image in FIG. 1.

FIG. 3 is a table illustrating the elements of a typical representationof a reference array.

FIG. 4 is illustrates a schematic representation of field number versusy image position in an interlace video field.

FIG. 5a illustrates a cross-sectional view of zero mean edge template.

FIG. 5b illustrates a plan view of a zero mean edge template

FIG. 6 illustrates a correlation surface.

FIG. 7 illustrates a measured and predicted position on a surface.

FIG. 8 illustrates a schematic flow diagram of how a track, reference,and code hierarchy of reference arrays is used to manage an adaptivereference array.

FIG. 9 illustrates a schematic view of landmarks and their associatedsensor points used for color based occlusion.

FIG. 10 is a schematic representation of an event broadcast using acombination of camera sensors and image tracking system.

FIG. 11 is a block diagram describing the system of the presentinvention in which the camera data is used to predict landmark location.

FIG. 12 is a block diagram describing the system of the presentinvention in which the camera data is used to provide extra "virtual"landmarks appropriately weighted to compensate for camera data errors.

FIG. 13 illustrates a camera fitted with pan, tilt, zoom and focussensors.

FIG. 14 illustrates a representation of data output from an opticallyencoded sensor.

FIG. 15 illustrates the relationship between the transition of sensortrack A, the state of sensor track B and the direction of rotation,clockwise (CW) or counter-clockwise (CCW), of the sensor.

FIG. 16 illustrates a common reference image taken from a broadcastimage.

FIG. 17 illustrates a plot of Zoom (Image Magnification) against Z (thenumber of counts from the counter attached to the zoom lens'zoom-element driver) with the focus-element of the lens held stationary.Three other plots are overlaid on top of this Zoom against Z plot. Thethree overlays are plots of Zoom (Image Magnification) against F (thenumber of counts from the counter attached to the zoom lens'focus-element driver) at three distinct, different and fixed settings ofZ (the counts from the zoom-element driver).

FIG. 18 illustrates a camera fitted with accelerometers (sensors) fordetecting camera motion.

FIG. 19 illustrates three fixed receiving stations used to track themotion of a mobile camera fitted with a transmitter.

FIG. 20 illustrates a broadcast situation in which the camera and anobject of interest to the event, such as a tennis ball are both fittedwith transmitters.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

During the course of this description like numbers will be used toidentify like elements according to the different figures thatillustrate the invention.

The standard LVIS search/detection and tracking method, as described inSer. No. 08/580,892 filed Dec. 29, 1995 entitled "METHOD OF TRACKINGSCENE MOTION FOR LIVE VIDEO INSERTION SYSTEMS", uses templatecorrelation with zoom insensitive templates, such as edges, to follow agroup of pre-designated landmarks or some subset of a group within ascene. Template correlation of landmarks provides raw positioninformation used to follow the motion of a scene. Typically, thelandmarks used may be parts of the structure in a ball park or markingson a field of play. Creating an ideal mathematical formulation of thescene to be tracked is a key part of the tracking algorithm. This idealmathematical representation is referred to as the reference array and issimply a table of x,y coordinate values. The term "image" associatedwith the array is for operator convenience. Current images or scenes arerelated to this reference array by a set of warp parameters which definethe mathematical transform that maps points in the current scene tocorresponding points in the reference array. In the simple case in whichrotation is ignored or kept constant the current image is mapped to thereference array as follows:

    A'=a+bx

    y'=d+by

where x' and y' are the coordinates of a landmark in the current scene,x and y are the coordinates of the same landmark in the reference arrayand b is the magnification between the reference array and the currentscene, a is the translation in the x direction and d is the translationin the y direction between the reference array and the current scene.

The essence of adaptive, geographic hierarchical tracking is paying mostattention to landmarks which are found at or close to their anticipatedmodel derived positions.

The first step is to obtain an accurate velocity prediction scheme tolocate the anticipated model derived position. Such a scheme estimates,via the warp parameters from the previous field or scene, where thelandmarks in the current image should be. The primary difficulty withvelocity prediction in interlaced video is that from field to fieldthere appears to be a one pixel y component to the motion. The presentinvention handles this by using the position from the previous likefield, and motion from the difference between the last two unlikefields.

Having predicted where in the current image the landmarks should be,template correlations over a 15 by 15 pixel region are then performedcentered on this predicted position. These correlation patterns are thensearched from the center outward looking for the first match thatexceeds a threshold criteria. Moreover, each landmark has a weightingfunction whose value is inversely proportional to the distance thelandmark is away from its anticipated model derived position. Whencalculating the new warp parameters for the current scene, eachlandmark's current position is used weighted by this function. Thisgives more emphasis to landmarks which are closer to their predictedpositions.

A further step, necessary to compensate for camera distortion as thescene moves, is to dynamically update the reference array coordinates ofthe landmarks based on their current locations. This updating is doneonly on good landmarks, and is itself heavily weighted by the distanceerror weighting function. This adaptive reference array allows veryaccurate tracking of landmarks even as they pass through lens andperspective distortions. The danger in having an adaptive referencearray is that it may get contaminated. This danger is mitigated byhaving three sets of reference coordinates, which are referred to as thecode, game and tracking reference coordinates. When the system isinitially loaded, the code reference coordinates are set to the originalreference coordinates. The game and tracking coordinates are initiallyset equal to the code reference coordinates. Once the system locates ascene and begins tracking, the tracking coordinates are used. However,each time a scene cut occurs, the tracking coordinates are automaticallyreset to the game reference coordinates. At any time the operator maychoose to set the current tracking coordinates equal to the gamereference coordinates or to set the game reference coordinates back tothe code reference coordinates. This scheme allows for adaptivereference updating with operator override capability.

The final element in the tracking scheme is a method of determining whena landmark is obscured by some object, so as to avoid spurious data inthe system. A color based occlusion method is used in which a set ofsensor points in a pattern around where a landmark is found are examinedand if they are found to differ from the colors expected in thoseregions, the landmark is deemed to be occluded and not used in furthercalculations. The sensor points from good landmarks are used to updatethe reference values for expected colors of the sensor points so thatthe system can accommodate changing conditions such as the gradual shiftfrom sunlight to artificial light during the course of a broadcast.

This strategy of adaptive, hierarchical tracking has proved to be ameans of high precision and robust tracking of landmarks within videosequences even in the noisy, real world environment of live broadcasttelevision.

Referring to FIG. 1, motion tracking of video images which allowseamless insertion as practiced by this invention, starts with areference array 10 of a scene in which insertions are to be placed.Although having an actual image is a useful mental aid, this referencearray is nothing more than a set of idealized x,y coordinate valueswhich represent the position of a number of key landmark sets 16 and 18within reference array 10. A typical table is shown in FIG. 3,illustrating the listing of x, or horizontal coordinates 31, and the y,or vertical coordinate positions 33. The positions 31 and 33 of keylandmark sets 16 and 18 are used both as references against which motioncan be measured and in relation to which insertions can be positioned. Atypical reference array 10 of a baseball scene from a center fieldcamera will consist of the locations of features such as the pitcher'smound 12, the back wall 14, vertical lines 15 between the pads whichmake up the back wall 14, and the horizontal line 17 between the backwall and the field of play on which the horizontal set of landmarks 18are set.

The current image or scene 20 is the field from a video sequence whichis presently being considered. Locations of key features or landmarksets 16 and 18 from reference array 10 also are indicated in currentimage 20 as measured positions 26 and 28. Measured positions 26 and 28are related to corresponding reference array landmark locations fromsets 16 and 18 by a set of warp parameters which define a mathematicaltransform that most accurately maps the position of points in currentimage 20 to the position of points in reference array 10. Such mappingsare well known mathematically. See, "Geometrical Image Modification inDigital Image Processing", W. K. Pratt 2nd Edition, 1991, John Wiley andSons, ISBN 0-471-85766.

Tracking the view from a fixed television camera, especially one with areasonably long focal length as in most sporting events, can be thoughtof as mapping one two-dimensional surface to another two-dimensionalsurface. A general mathematical transform that accomplishes such amapping allowing for image to image translation, zoom, shear, androtation is given by the following six parameter model:

    x'=a+bx+cy

    y'=d+ex+fy

where

x and y are coordinates in reference array 10,

x' and y' are the transformed coordinates in current image 20,

a is the image translation in the x direction,

b is the image magnification in the x direction,

c is a combination of the rotation, and skew in the x direction,

d is the image translation in the y direction,

e is a combination of the rotation, and skew in the y direction, and

f is the image magnification in the y direction.

The tracking algorithms and methods discussed herein can be used withthe above transformation as well as other more general transformations.However, experience has shown that with a dynamically updated referencearray, a simpler x,y mapping function which assumes no shear or rotationwill suffice. Thus, in the simple case in which rotation is ignored orkept constant (c=e=0) and the magnification in the x and y directions isthe same (b=f) the position of points in current image 20 are mapped tothe position of points in reference array 10 using the followingequations:

    x'=a+bx

    y'=d+by

where x' and y' are coordinates of a landmark in current image 20, x andy are coordinates of the same landmark in reference array 10, b is themagnification between reference array 10 and current image 20, a is thetranslation in the x direction, and d is the translation in the ydirection. This simplified mapping scheme is used because experience hasshown it to be both robust and capable of handling the limited shear,rotation, and perspective distortion present in television sportsbroadcasts when a dynamically updated reference array is used.

Motion tracking is the method of measuring positions of landmark sets 26and 28 in current image 20 and using these measurements to calculate thewarp parameters a, d and b, as defined by the equations above. Animportant part of adaptive geographic hierarchical tracking is theconcept of assigning a weight to each landmark. Weights are assigned, ininverse proportion, according to the distance each landmark is detectedaway from where it is expected or predicted to be found. The closer alandmark is found to where it is predicted to be, the greater the weightgiven to that landmark in the calculation of the warp parameters linkingthe positions in current image 20 to the positions in reference array10.

The first step is predicting where the landmarks 26 and 28 should be incurrent image 20. This is done by analyzing the landmark positions inthe three previous fields. The previous position and velocity of alandmark derived from the previous model is used to estimate where thelandmark will appear in the current image 20. The position and velocitycalculations are complex in that both the current standard methods oftelevision transmission, NTSC and PAL, are sent in two verticallyinterlaced fields. Thus, alternate horizontal scans are included inseparate fields, customarily referred to as odd and even fields. In theNTSC system, each field is sent in 1/60th of a second (16.6 msecs),making a combined single frame every 1/30th of a second.

One important practical consideration in the velocity estimations isthat the x and the y positions in the previous fields (-1, -2 and -3)that are used in the velocity estimations are not the measuredpositions, but the positions calculated using the final warp parametersderived in each of those fields. That is, in each field, x and ypositions are measured for each landmark. All of the landmarks are thenused to derive a single set of warp parameters a, b and d giving themapping between the current and the reference array. That single set ofwarp parameters is then used to project the reference array coordinates10 into the current image 20, giving an idealized set of landmarkpositions in the current image. It is this idealized set of landmarkpositions in each field, referred to as the model derived positions,that are used in the velocity predictions.

As illustrated in FIG. 4, the current y or vertical position of alandmark is predicted from the previous three fields. The y position inthe current field (field 0) is predicted by measuring the y component ofvelocity as the difference between the landmark's model derived positionin field -1 and field -3, which are "like" fields in that they are botheither odd or even. The y velocity component is then added to the modelderived y position in field -2, which is the previous field "like" thecurrent field, to arrive at an estimate of where to find that landmarkin the current field.

The prediction in the x direction could use the same algorithm or, sincethere is no interlace, the x direction calculation can be simpler andslightly more current. In the simpler scheme, the x component of thevelocity is calculated as the difference between the landmark's modelderived position in field -1 and its model derived position in field -2.This difference is then added to the model derived position in field -1to arrive at an estimate of where to find that landmark in the currentfield.

Having predicted the most likely position of all the landmarks in thecurrent image, the positions of the landmarks are then found by doing acorrelation of an 8 by 8 pixel template over a 15 by 15 pixel regioncentered at the predicted position. Correlation or template matching isa well known technique, and in its standard form is one of the mostfundamental means of object detection. See, Chapter 20, "Image Detectionand Recognition of Digital Image Processing" by W. K. Pratt (2ndEdition, 1991, John Wiley and Sons, ISBN 0-471-85766). Unlike morestandard methods of correlation or template matching in which thetemplate is made to closely resemble the part of the scene it is beingused to find, the templates in the present invention are synthetic,idealized both in shape and value, and are "zero-mean".

For instance, in tracking a football goal post upright, rather than usea portion of the goal post taken from the image, the template 54 used isan edge of uniform value made from a negative directed line 56 and apositive directed line 58, and the sum of the values in the 8 by 8template is equal to zero as shown schematically in cross-section inFIG. 5a and in plan view in FIG. 5b.

This template has the advantages of being zoom independent and will givea zero value on a surface of uniform brightness. The technique is notlimited to 8 by 8 pixel templates, nor is the region over which they arecorrelated limited to 15 by 15 pixel regions. Further, this technique isnot limited to zero mean templates either. In circumstances where onlyvertical and horizontal lines and edges are being tracked it is possibleto reduce computation by having (1×n) correlation surfaces for followingthe horizontal detail, and (n×1) correlation surfaces for following thevertical detail where n is any reasonable number, usually in the rangeof 5-50 pixels.

The idealized, zero-mean edge template 54 is correlated over a 15 by 15pixel region of the current image or some amplified, filtered anddecimated replica of it to produce a correlation surface 60 as shownschematically in FIG. 6. This correlation surface 60 consists of a 15 by15 array of pixels whose brightness correspond to the correlation of theimage against the template when centered at that position. Typically, anedge template 54 correlated over a region of an image containing a linewill give both a positive going line response 66, indicating a goodmatch and a corresponding negative going line 67 indicating a mismatch.This mismatch line 67 can be useful in that its position and distanceaway from the positive going match line 66 give a measure of the widthof the line and whether it is brighter or darker than its surroundings.In addition, there will be other bright pixels 68 on the correlationsurface 60 corresponding to bright edge like features in the currentimage.

A guiding principle of the adaptive-geographic-hierarchical trackingmethod is to focus on landmarks and the correlation peaks indicatingpotential landmarks that are closest to where they are expected to be.Rather than just looking for a peak anywhere on the 15 by 15 correlationsurface 60, these patterns are searched from the center outward. Thesimplest, and very effective, way of doing this is to first look at thecentral nine pixel values in the central 3 by 3 pixel region 64. If anyof these pixels has a correlation value greater than a threshold then itis assumed that the pixel represents the landmark being sought and nofurther investigation of the correlation surface is done. The thresholdis usually fifty percent of the usual landmark correlation anticipated.This 3 by 3 initial search allows motion tracking even in the presenceof nearby objects that by their brightness or shape might confuse thelandmark correlation, such as when the pixel marked 68 had been brighterthan the pixels in the line 66. Once the pixel with the peak brightnessis found, an estimate of the sub pixel position is found using the wellknown method of reconstructing a triangle as discussed in co-pendingU.S. Pat. application Ser. No. 08/381,088. There are other sub pixelposition estimating methods that may be used such as fitting higherorder curves to the data.

In addition, each landmark found in a scene has an error weightassociated with it based on its distance from where it is expected tobe. Referring now to FIG. 7, the calculation of this error weight isbased on the predicted position in the image 70, at the coordinates xp,yp and the measured position in the image 72, at the coordinates xm, ym,using the general equation: ##EQU1## where g, h, i, j, k, and l arenumerical constants chosen to vary the strength of the weightingfunction.

In the preferred embodiment the parameters of the equation are: ##EQU2##although in special circumstances, each of the parameters might have adifferent value to change the emphasis of the weighting. For instance,numerical constants i and j may be varied to provide a function whichstays constant for a short distance and then drops rapidly.

This error weight is then used in the calculation of the warp parameterswhich maps points in the current image 20 to the positions in thereference array 20. In the preferred embodiment this calculation is aweighted least mean squares fit using the following matrix: ##EQU3##where ##EQU4##

In the case of purely horizontal landmarks, nx=0 and ny=1 and in thecase of purely vertical landmarks nx=1 and ny=0. In the more generalcase, nx and ny are the direction cosines of vectors representing thenormal to the landmarks predominant direction.

The adaptive part of the motion tracking scheme is necessary to allowfor camera distortion. It also allows the system to compensate for smalldiscrepancies between the stored idealized reference array and theactual scene as well as allowing the system to handle small slowrotation and/or shear. It further allows the system to handle any smalland slowly occurring distortions. This adaptation is done by dynamicallyupdating the reference array coordinates based on their currentlocations. In the present invention the adaptive part of the motiontracking is made stable by the following criteria: 1) being very carefulwhen it is allowed to occur; 2) choosing which landmarks are allowed toparticipate based on how confident the system is that said landmarks aregood; and 3) having the whole calculation heavily weighted by thedistance error weighting function. In addition, the reference array isreset after any scene cuts.

In the preferred embodiment the dynamic updating of the referencecoordinates is started after six fields of tracking and is only done onlandmarks which have not been flagged by any occlusion checks and havecorrelation values greater than 20% and less than 200% of expectedreference values, though different values may be used for all theseparameters.

The measured landmark positions are back projected to the positions inthe reference array using the warp parameters calculated by all the goodlandmarks in the current field using the equations;

    Xnr=(Xm-a)/b

    Ynr=(Ym-d)/b

    Xr=X0r+(ErrorWeight).sup.2 (Xnr-X0r)

    Yr=Y0r+(ErrorWeight).sup.2 (Ynr-Y0r)

where:

Xm is the measured x coordinate of the landmark,

Ym is the measured y coordinate of the landmark,

a is the horizontal translation warp parameter,

d is the vertical translation warp parameter,

b is the magnification warp parameter,

Xnr is the calculated x coordinate of a proposed new reference pointbased on this field's data,

Ynr is the calculated y coordinate of a proposed new reference pointbased on this field's data,

XOr is the x coordinate of the old reference point prior to update,

YOr is the y coordinate of the old reference point prior to update,

Xr is the x coordinate put into the table as the new reference point,and

Yr is the y coordinate put into the table as the new reference point.

It is also possible to use separate tracking reference arrays for oddand even fields to improve the tracking performance with interlacevideo. Because of the potentially unstable nature of the adaptivereference array, the preferred embodiment has three related referencearrays, referred to as the: CODE REFERENCE, GAME REFERENCE, and TRACKINGREFERENCE.

The schematic flow diagram in FIG. 8 illustrates how these threereferences are used. At start up, when the initial system is loaded, allthree references are set to be the same, i.e. CODE REFERENCE=GAMEREFERENCE=TRACKING REFERENCE, which is to say that the x and the ycoordinates of the landmarks in each of the reference arrays are set tobe the same as the coordinates of the landmarks in the code referencearray.

At run time, when the image processing is done, the three referencearrays are used in the following manner. The game reference is used insearch and verify mode and in tracking mode the tracking reference isused.

Initially the tracking reference array is set equal to the gamereference array. In the preferred embodiment this occurs on the firstfield in which the tracking is done. In subsequent fields the trackingreference is modified as detailed above. If separate tracking referencearrays are being used for odd and even fields they would both initiallybe set to the game reference array.

At any time during the tracking mode, the operator may elect to copy thecurrent tracking references into the game reference using standardcomputer interface tools such as a screen, keyboard, mouse, graphic userinterface, trackball, touch screen or a combination of such devices.This function is useful at the start of a game. For instance, anoperator may be setting up the live video insertion system to performinsertions at a particular stadium. The code reference coordinates havelandmark positions based on a previous game at that stadium but theposition of the landmarks may have been subtly altered in theintervening time. The code reference, however, remains good enough forsearch and tracking most of the time. Alternatively, by waiting for ashot, or having the director set one up prior to the game, in which allthe landmarks are clear of obstruction, and allowing for the adjustmentof the tracking reference to be completed, a more accurate gamereference for that particular game can be achieved.

At any time, in either the tracking or search mode, the operator canelect to reset the game reference to the code reference. This allowsrecovery from operator error in resetting the game reference to acorrupted tracking reference.

An important part of the adaptive reference process is restricting theupdating to landmarks which are known to be un-occluded by objects suchas players. The method used for this landmark occlusion detection in thepreferred embodiment is color based and takes advantage of the fact thatmost sports are played on surfaces which have well defined areas offairly uniform color, or in stadiums which have substantial features ofuniform color, such as the back wall in a baseball stadium. Eachlandmark 90 as shown in FIG. 9, has sensor points 92 associated with it.These sensor points 92, which in the preferred embodiment vary from 3 to9 sensor points per landmark 90, are pixels in predetermined locationsclose to, or preferably surrounding the landmark they are associatedwith. More importantly, the sensor points are all on areas of reasonablyuniform color. The decision on whether the landmarks are occluded or notis based on looking at the sensor points and measuring their deviationfrom an average value. If this deviation exceeds a pre-set value, thelandmark is presumed to be occluded. Otherwise it is available for usein other calculations, such as the model calculations and the referencearray updating.

The discussion up until this point has described the LVIS search/detectand track features of co-pending application Ser. No. 08/580,892 filedDec. 29, 1995 entitled "METHOD OF TRACKING SCENE MOTION FOR LIVE VIDEOINSERTION SYSTEMS"

The concept of the present invention is to augment the velocityprediction scheme of a standard LVIS-with camera sensor data. While suchaction may sound trivial, it is in fact a complex undertaking thatrequires synchronicity between different data formats. Camera sensordata provides a "snap-shot" of a complete image field which can bereduced to a two-dimensional image coordinate array where the entireimage array is mapped all at once, i.e. at a single instant in time.That is to say, the pixels on the left side of the array represent thesame instant in time as the pixels on the right side of the array.Motion tracking using a standard LVIS technique, however, is acontinually updating process with respect to the image arraycoordinates. Thus, at any given instant, the pixels on the left side ofan image array do not represent the same instant in time as the pixelson the right side of the image array. For the hybrid system of thepresent invention to perform seamlessly, such anomalies must beaccounted and compensated for.

Referring to FIG. 10, there is a camera 110 having lens 112 mounted on atripod mount 111, set up to record a tennis match on a tennis court 115.The camera 110 and lens 112 are fitted with a set of sensors 113designed to measure the pan, tilt, zoom and focus of the lens 112 andcamera 110. Sensors 113 also determine whether double magnificationoptics are being used. Broadcast cameras usually have a "doubler"element, which can be switched in or out of the lens' train of opticalelements at the turn of a knob. Use of this doubler effectively doublesthe image magnification at any given setting of the lens' zoom-element.This means that a single reading of Z (the counts from the zoom-elementdriver) is associated with two different values of zoom or imagemagnification. Data gatherer 114 receives data from camera sensors 113before feeding same to a Live Video Insertion System (LVIS) 118 having adata interpreter 116. Data interpreter 116 converts data forwarded bydata gatherer 114 into a form that can be used by the LVIS system. Othersimilar cameras with sensors are positioned throughout the event sitefor recording different views of the action.

FIG. 10 also shows some of the usual broadcast equipment, such as aswitcher 120, used in a television production. A switcher allows thedirector to choose among several video sources as the one currentlybeing broadcast. Examples of other video sources shown in FIG. 10include additional cameras 110 or video storage devices 122. Switcher120 may also include an effects machine 124 such as a digital videoeffects machine. This allows the director to transition from one videofeed to another via warpers or other image manipulation devices. Warpersare image manipulation devices that translate an image from oneperspective to another, such as, for instance, a change in zoom, pan, ortilt.

The program feed is next sent to an LVIS 118. In addition to thesearch/detection, i.e. recognition, and tracking abilities of a typicallive video insertion system, the LVIS 118 of the preferred embodiment ofthe present invention further includes a data interpreter 116. Datainterpreter 116 interprets camera sensor data from data gatherer 114 andtally information received from switcher 120 thereby informing LVIS 118which video source is currently being broadcast. LVIS 118 is furtherequipped with software and hardware decision module 126. Decision module126 allows LVIS 118 to use sensor data in place of traditional searchmode data obtained via the pattern recognition techniques previouslydescribed. Decision module 126 can switch between a conventional patternrecognition tracking mode or a mode where tracking is done via acombination of camera sensor data and pattern recognition.

Once the video has passed through LVIS 118 an indicia 136 is seamlesslyand realistically inserted in the video stream. The insertion may bestatic, animated, or a live video feed from a separate video source 128.The resultant video signal is then sent via a suitable means 130, whichmay be satellite, aerial broadcast, or cable, to a home receiver 132where the scene 135 with inserted indicia 136 is displayed on aconventional television set 134.

Referring now to FIG. 13, the set of sensors that determine the pan andtilt of camera 110 comprise precision potentiometers or optical encodersdesigned to measure the rotation about the horizontal 146 and vertical142 axes. Similar sensors also determine the focus and zoom of lens 112by measuring the translation of optical elements within lens 112. Focusand zoom motion are determined by measuring the rotation of the shaftsthat move the optical elements that define focus and zoom. This is doneby measuring the rotation about axis 150 of handle 148 used by thecamera operator to change zoom, and about axis 154 of handle 152 used bythe camera operator to effect changes in focus.

Data from pan sensor 140, tilt sensor 144, zoom sensor 149, and focussensor 153 are collected by data gatherer 114. Data gatherer 114 thentakes the raw voltages and/or sensor pulses generated by the varioussensors and converts them into a series of numbers in a format that canbe transmitted to data interpreter 116 of LVIS 118. Data interpreter 116may be located remotely or on-site. Data gatherer 114 may take the formof a personal computer equipped with the appropriate communications andprocessing cards, such as standard analog-to-digital (A/D) convertorcards and serial and parallel communications ports.

For potentiometer data, such as zoom sensor 149 and focus sensor 153,data gatherer 114 converts an analog voltage, typically in the range -3to +3 volts, into a digital signal which is a series of numbersrepresenting the position of the lens. These numbers may be gathered atsome predetermined data rate such as once per video field or once every6 milliseconds and forwarded to data interpreter 116 of LVIS 118. Or,LVIS 118 may send a request to data gatherer 114 requesting an update onone or more of the parameters being used.

Data from a typical optical encoder is in three tracks as illustrated inFIG. 14. Each track consists of a series of binary pulses. Tracks A andB are identical but are a quarter period out of phase with one another.A period is the combination of a low and a high pulse. In a typicaloptical encoder one rotation of the sensor device through 360 degreeswill result in approximately 40,000 counts where a count is each timethe encoder output goes from 0 to +1 or from +1 to 0. The reason forhaving two data tracks a quarter period out of phase is to inform datainterpreter 116 which direction the sensor is being rotated. Asillustrated in FIG. 15, if track A is making a transition then the stateof track B determines whether the sensor is being rotated clockwise orcounter-clockwise. For instance, if track A is making a transition froma high state to a low state and if track B is in a high state then thesensor is rotating clockwise. Conversely, if track B is in a low statethe sensor is rotating counter-clockwise.

By studying the tracks A and B, data gatherer 114 can monitor sensorposition simply by adding or subtracting counts as necessary. All thatis needed is a reference point from which to start counting. Thereference point is provided by track C. Track C has only two states, +1or 0. This effectively defines a 0 degree point and a 180 degree point.Since in a practical, fixed camera setup the arc through which thecamera is rotated is less than 180 degrees, we need only consider thezero setting case.

By monitoring track C transitions, data gatherer 114 is able to set therotation counters to zero and then increment or decrement the countersby continuously monitoring tracks A and B. At suitable intervals, suchas once per field or once every 6 milliseconds, the rotation position ofthe optical sensor can be forwarded to data interpreter 116.Alternately, at any time, LVIS 118 may send a request to data gatherer114 for a current measurement of one or more of the parameters beingmonitored.

The function of data interpreter 116 is to convert the digitizedposition and/or rotational information from data gatherer 114 into aformat compatible with and usable by a typical LVIS tracking system.Referring to FIG. 16, sensor data from the camera and lens is madecompatible with the LVIS tracking system by means of a common referenceimage.

The common reference image is a stored image that allows formathematical modeling or translation between a conventional LVIStracking system, such as that described in commonly owned applicationSer. No. 08/580,892, entitled "METHOD OF TRACKING SCENE MOTION FOR LIVEVIDEO INSERTION SYSTEMS" and a system relying exclusively on camerasensor data. Typically, the common reference image is modeled upon thechosen tracking method, i.e. adaptive geographical hierarchical ortexture analysis for instance, and the camera sensor data is translatedto that chosen tracking model.

There are several important aspects to the common reference image. Firstis origin. The origin is chosen as the point at which the camera lensoptical axis goes through the common reference image. This is typicallynot the center of the video image for two reasons. First, there may be aslight misalignment between the axis of the zoom elements of the lensand the optical axis of the main lens components. Second, the CCD arrayof the camera may not be exactly perpendicular to the optical axis ofthe lens.

This offset can be handled one of two ways. First, a zoom dependent skewparameter can be added to the interpretation of the data. Or, second, azero point within the common reference image can be defined at the pointwhere the camera lens optical axis crosses the common reference image.The zero point can be determined in practice in a number of ways. Thepreferred method first sets up a cross hair on the image at the centerof the image. Second, zoom in on a fiducial point. A fiducial point is afixed or reference point. Next, pan and tilt the camera until thecross-hair is centered on the fiducial point. Then zoom out as far aspossible. Now move the cross hair on the image until it is centeredagain on the fiducial point. Lastly, repeat the second and third stepsuntil the cross-hair stays centered on the fiducial point as the camerais zoomed in and out. The x, y coordinates of the fiducial point and ofthe cross-hair are now the (0,0) points of the common reference image,i.e. the origin.

The common reference image shown in FIG. 16 is an image of a stadium orevent taken at some intermediate zoom with a known setting of the cameraparameters pan, tilt, zoom, and focus. The common reference image is aconvenience for the operator. For convenience, we make the followingdefinitions: P=Pan counts (the number that pan encoder 40 is feeding tothe data interpreter); T=Tilt counts (the number that tilt encoder 44 isfeeding to the data interpreter); Z=Zoom counts (the number that zoomencoder 49 is feeding to the data interpreter); and F=Focus counts (thenumber that focus encoder 53 is feeding to the data interpreter). Camerasensor readings are also recorded contemporaneously with the commonreference image and are given the following designations: Z₀ =Z at thetaking of the common reference image; F₀ =F at the taking of the commonreference image; T₀ =T at the taking of the common reference image; P₀=P at the taking of the common reference image; and (X₀,Y₀) are thecoordinates in the common reference image of the (0,0) point definedabove.

Three calibration constants are required to translate the camera sensordata into a form usable by a conventional LVIS image tracking system.These constants are: xp, the number of x pixels moved per count of thepan sensor at Z₀, F₀ ; yt, the number of y pixels moved per count of thetilt sensor at Z₀, F₀ ; and zf, the number of the Z count equivalent ofthe F count sensor at Z₀. xp and yt are related by a simple constant buthave been identified separately for the sake of clarity.

FIG. 17 is a linear plot of Z, the counts from the zoom counter alongthe x-axis, versus the zoom along the y-axis. The zoom at the commonreference image settings is the unit zoom. As can be seen from thedotted lines, a side effect of adjusting the camera focus element is analteration in the image magnification or zoom. The nature of thealteration is very similar to the nature of the alteration in imagemagnification produced by zoom adjustment. However, the change in imagemagnification (zoom) brought about by adjusting the focus-elementthrough its entire range is significantly smaller than the change inimage magnification brought about by adjusting the camera zoom elementthrough its entire range.

This can be understood graphically by considering two sets of plots.First, a graph is made of Image Magnification (Zoom) vs. the adjustmentof the zoom elements of the lens (as measured by counting the number ofrotations, Z, of the screw shaft moving the zoom-elements in the zoomlens), with the focus-element of the zoom lens kept at a fixed setting.This first plot is called the Magnification vs. Zoom plot.

Second, a number of graphs are made of Image Magnification vs. theadjustment of the focus element of the lens (as measured by counting thenumber of rotations, F, of the screw shaft moving the focus-elements inthe zoom lens) at a number of distinct settings of Z, the position ofthe zoom-element. These graphs are called the Magnification vs. Focusplots.

The Magnification vs. Focus plots can then be overlaid on to theMagnification vs. Zoom plot. By compressing the focus axis of theMagnification vs. Focus plots, the shape of the Magnification vs. Focuscurve can be made to match the local curvature of the Magnification vs.Zoom plot, as shown in FIG. 17.

The important point is that the degree of compression of the Focus axisnecessary to make the Focus curves match the Zoom curve is the same foreach of the Magnification vs. Focus curves, despite their being made atdifferent, fixed values of Z. This means that it is possible to simplifythe mathematics of the interaction of zoom and focus on the image sizeby treating zoom and focus adjustments in a similar fashion. Inparticular, in determining image size or magnification, it is possibleto interpret the data from the focus sensor (the counter measuring theposition of the focus-element) as being equivalent to data from the zoomsensors (the counter measuring the position of the zoom-element). Allthat is needed to make the Zoom and Focus data equivalent is a simplemodification of the Focus data by a single off-set value and a singlemultiplication factor. Equivalent zoom counts are defined by:

    Z.sub.EC =zf(F-F.sub.0)

zf is a calibration constant determined by plotting zoom against Zcounts, and then overlaying the zoom against F counts at particularzooms. By adjusting the F counts so that the zoom from the focus fitsthe zoom curve, the constant zf can be found. The same thing can be doneanalytically by first determining the relationship between zoom and Zcounts, and the using that relationship to fit zoom to F counts, byadjusting zf.

In the preferred embodiment, zoom was fitted to Z using the followingexponential function using a least squares fit:

    .sub.Z=e (aZ.sub.0.sup.2 +bZ.sub.0 +c)

There may also be a lookup table to convert the raw zoom counts intozoom, or a combination of lookup table and a mathematical interpolationwhich may be similar to the expression in the equation above.

Calibration constants xp and yt are measured by pointing the camera atone or more points in the common reference image, i.e. centering thecross-hair on the optical axis of the lens and recording the P and Tvalues. By measuring the pixel distance in the common reference imagebetween the selected points and the (0,0) point, calibration constantsxp and yt are calculated by means of the following two equations:

    xp=(X-X.sub.0)/(Y-Y.sub.0)

    yt=(Y-Y.sub.0)/(T-T.sub.0)

Constants xp, yt, zf, a, b and c are used with reference constants, Z₀,F₀, P₀, T₀, image tracking software, or to calculate the position of apoint in the current image whose location is known with respect to areference array of the common reference image.

In the simplest affine representation, ignoring rotation and assumingzoom is the same in the x and y directions the position of an object canbe related to its position in the common reference image by theequations:

    x.sub.i =Zx.sub.r +t.sub.x

    y.sub.i =Zy.sub.r +t.sub.y

where x_(i) and y_(i) are the x and y position of an object in thecurrent image, x_(r) and y_(r) are the x and y position of the sameobject in the common reference image, Z is the zoom between the currentimage and the common reference image, and t_(x) and t_(y) are x and ytranslations between the current image and the common reference image.In the conventional LVIS tracking equations, Z, t_(x) and t_(y) aresolved for by measuring the position of a set of known landmarks, usinga weighted least squares fit. Having found Z, t_(x) and t_(y), any otherpoint in the common reference image can then be mapped into the currentimage using the equations for x_(i) and y_(i).

From equations above it can be seen that Z is simply: ##EQU5## where.sup.μ is the combined zoom and focus counts as defined by:

    μ=Z+zf(F-F.sub.0)

t_(x) and t_(y) are found from the camera sensors using therelationships:

    t.sub.x =xp(P-P.sub.0)

    t.sub.y =yt(T-T.sub.0)

In the preferred embodiment, data interpretation unit 116 is eithersoftware or hardware implementation, or a combination of software andhardware implementation of the equations converting sensor data P, T, Zand F into Z, t_(x) and t_(y), having been calibrated by defining P₀,T₀, Z₀, F₀, X₀, Y₀, zf, xp and yt.

The x and y position of a point can be expressed directly in terms ofP₀, T₀, Z₀, F₀, X₀, Y₀, zf, xp and yt by:

    x.sub.i =x.sub.r Z+xp(P-P.sub.0)

    y.sub.i =y.sub.r Z+yt(T-T.sub.0)

Whichever implementation is used, the implementation in hardware orsoftware may be by the analytic expressions detailed above, by lookuptables which express or approximate the expressions, the experimentaldata the expressions were derived from, or by a combination of lookuptables, analytic expressions and experimental data.

The LVIS can now use the translated camera sensor data in a number ofways. Whichever method is used, however, must compensate for lensdistortion of the particular lens being used.

One method for using the translated camera data is to use the Z, t_(x)and t_(y) affine conversion for search only, and then switch toconventional tracking. This means that the lens distortion can becompensated for conventionally by having a deformable common referenceimage as described in detail in commonly owned co-pending applicationsSer. Nos. 08/563,598 and 08/580,892 entitled "SYSTEM AND METHOD FORINSERTING STATIC AND DYNAMIC IMAGES INTO A LIVE VIDEO BROADCAST" and"METHOD OF TRACKING SCENE MOTION FOR LIVE VIDEO INSERTION SYSTEMS"respectively.

A second application for using translated camera data is to use it tosupplement the tracking capability of the system by using the Z, t_(x)and t_(y) affine conversion to create one or more image-centriclandmarks, which are always visible, but which have a weighting factorthat always gives an error of about 2 pixels, and then feed these extralandmarks into a matrix based landmark tracking system as explained indetail in co-pending patent application Ser. No. 08/580,892 filed Dec.29, 1995 entitled "METHOD OF TRACKING SCENE MOTION FOR LIVE VIDEOINSERTION SYSTEMS". The flexible common reference image would have to beextended to include flexible camera reference parameters.

A third method for using the translated camera data is to supplement thetracking capability of the system by using the Z, t_(x) and t_(y) affineconversion to predict, or as part of the prediction, where opticaltracking landmarks should be in the current image, and then use landmarkor texture tracking to improve whatever model is being used to relatethe current image to the reference array to the extent that recognizablestructure is available. Texture tracking is described in co-pendingprovisional application Ser. No. 60/031,883 filed Nov. 27, 1996 entitled"CAMERA TRACKING USING PERSISTANT, SELECTED, IMAGE TEXTURE TEMPLATES".This approach can be used for any model representation including fullaffine and perspective. Distortion compensation is more difficult,especially if the supplementation is going to be modular--i.e. availableon, for instance, the zoom, x offset (or horizontal translation) and yoffset (or vertical translation) separately and in any combinationthereof. One robust way is to have a function or look up table that mapsthe distortion.

Having determined the model relating the current image to the commonreference image, the remainder of the LVIS, including insertionocclusion, can be used normally, as described in detail in co-pendingpatent application Ser. No. 08/662,089 entitled "SYSTEM AND METHOD OFREALTIME INSERTIONS INTO VIDEO USING ADAPTIVE OCCLUSION WITH A SYNTHETICCOMMON REFERENCE IMAGE".

In an alternative embodiment of the invention illustrated in part inFIG. 18, in addition to the pan, tilt, zoom and focus sensors 113already described, there are two additional sensors 160 and 164 fittedin the transition module by which the camera 110 and lens 112 areattached to the tripod mount 111. These additional sensors 160 and 164are accelerometers which measure acceleration in two orthogonaldirections 162 and 166. The data from the accelerometers is fed to thedata gathering unit 114, where it is integrated twice with respect totime to provide the current displacement of the camera in the x and ydirections. Displacement data is fed to data interpreting unit 116,where it is multiplied by some previously determined calibrationconstant, and added to the t_(x) and t_(y) components of the translatedaffine transform or multiplied by a related but different calibrationconstant and added directly to the pan and tilt counts respectively foruse in the direct conversion into image coordinates.

In a simplified version of this alternative embodiment, only theaccelerometer 160 measuring acceleration in the vertical direction isadded to the pan, tilt, zoom and focus sensors 113, as the most commonproblem with supposedly stationary cameras is that they are mounted onunstable platforms and the vertical shift is the major problem.

In a modification of the simplified version of the alternativeembodiment, a second accelerometer 163 is fitted at the front of thelens 112 so that camera compliance or oscillation in the verticaldirection, independent of tilt about the axis 146, can also be measuredand made use of in ascertaining the direction in which the camera 110and lens 112 are pointing at any given time.

In another, alternative embodiment of the invention illustrated in FIG.19, zoom and focus sensors 149 and 153 fitted to lens 112 are the sameas in the preferred embodiment, but tilt and pan sensors 140 and 144 arechanged, and there is an additional rotational sensor 174, and there isan additional Radio Frequency (RF) or Infra Red (IR) transmitter 170attached. The tilt sensor 144 is a plum bob potentiometer, measuringtilt from the normal to the local, gravitationally defined surface ofthe earth. The rotational sensor 174 is also a plum bob potentiometer,or a optical encoded sensor with a gravity sensitive zero indicator,designed to measure the rotation of the camera around the axis 176. Thepan sensor 140 is a sensitive, electronic compass measuring thehorizontal rotation away from a local magnetic axis, which may forinstance be the local magnetic north. The RF or IR transmitter 170 putsout suitably shaped pulses at predetermined, precisely timed intervals,which are picked up by two or more receivers 172 located in suitablepositions in the stadium. By measuring the difference in the arrivaltime of the pulses at the receivers 172 the location of the camera inthe stadium can be calculated to within a few millimeters. The data fromthe receivers 172 and the camera sensors 140, 144, 149 and 153 is thenfed to data interpreter 116 in the LVIS system. By combining the data,the system can calculate the position and orientation of the camera 110,as well as the focus and zoom of the lens 112. In this way a hand heldor mobile camera can be accommodated. In the affine modelrepresentation, the earlier equations have been extended to includecross terms to deal with the rotation, e.g.

    x.sub.i =Zx.sub.r +βy.sub.r +t.sub.x

    y.sub.i =Zy.sub.r +βx.sub.r +t.sub.y

where (variables) is a transformation constant to account for the extrarotational degree of freedom allowed by a hand held camera.

In another, alternative embodiment of the invention, illustrated in FIG.20, the system can handle both hand-held or mobile cameras and candetermine the position of objects of interest to the sport being played.For instance, in a tennis match being played on court 15, the ball 80could have a transmitter concealed in it, which may be a simple RadioFrequency (RF) or Infra Red (IR) transmitter, which is emitting suitablyshaped pulses at predetermined, precisely timed intervals that aredifferentiated from transmitter 170 attached to mobile camera 110,either by timing, frequency, pulse shape or other suitable means. Thereceivers 172, located in suitable positions in the stadium, now measureboth the difference in the arrival time of the pulses emitted by thecamera transmitter 170 and the object transmitter 180. The system is nowable to locate the instantaneous position of both the camera 110 and theball with transmitter 180. The data from the camera 110 and thereceivers are fed to the data gatherer 114 and then to the datainterpreter 116. The data interpreter 116 can now infer the location,orientation, zoom and focus of the camera 110 and lens 112, which can,as described in detail previously, provide search information to theLVIS system and may also be used to advantage in the track mode of theLVIS system. Furthermore, the data interpreter 116 can also provideinformation about the location of an object of interest 180 in thecurrent image, which may be used, for instance to provide viewerenhancements such a graphic 84 on the final output showing thetrajectory 182 of the object of interest.

It is to be understood that the apparatus and method of operation taughtherein are illustrative of the invention. Modifications may readily bedevised by those skilled in the art without departing from the spirit orscope of the invention.

What is claimed is:
 1. A method for tracking motion from field to fieldin a sequence of related video images that are scanned by at least onecamera having one or more hardware sensor devices, the method comprisingthe steps of:a) establishing an array of idealized x and y coordinatesrepresenting a reference array having a plurality of landmarks whereeach landmark has unique x and y coordinates; b) mapping x and ycoordinates in a current image to said x and y coordinates in saidreference array; c) acquiring camera sensor data from said hardwaresensor device, said camera sensor data representing the position andorientation of the camera; d) predicting the future location of saidlandmark coordinates, x' and y', using said camera sensor data, whereinprediction errors due to changes between two successive fields areminimized by adding (i) the field to field difference in landmarklocation calculated from said camera sensor data to (ii) the landmarkposition x, y previously located.
 2. The method of claim 1 wherein saidmapping is achieved according to the following relationships;

    x'=a+bx+cy

    y'=d+ex+fy

where: x is a horizontal coordinate in the reference array, y is avertical coordinate in the reference array, x' is a horizontalcoordinate in the current scene, y' is a vertical coordinate in thecurrent scene, a is a warp parameter for horizontal translation of theobject in the x direction, b is a warp parameter for magnificationbetween the reference array and the current image in the x direction, cis a warp parameter for a combination of rotation and skew in the xdirection, d is a warp parameter for vertical translation of the objectin the y direction, e is a warp parameter for a combination of rotationand skew in the y direction, and f is a warp parameter for magnificationbetween the reference array and the current image in the y direction. 3.The method of claim 2 wherein said video images are verticallyinterlaced where images from field to field alternate between like andunlike fields.
 4. The method of claim 3 wherein said predicting thefuture location of said landmark coordinates, x' and y', for saidinterlaced video images is based on a detected change of position ofsaid landmark from the previous like field.
 5. The method of claim 4further comprising the steps of:e) searching for one of said landmarksin said current image by means of correlation using a template where thesearch is conducted over a substantial region spanning the predictedlocation of said landmark; f) multiplying the results of saidcorrelation search in step (e) by a weighting function giving greaterweight to correlations closer in distance to the predicted location ofsaid landmark to yield a weighted correlation surface; g) searching saidweighted correlation surface for its peak value.
 6. The method of claim5 further comprising the steps of:h) determining new warp parametersa,b,c,d,e and f for a current image based on said landmark's currentposition in a current image weighted by said weighting function for thatlandmark,wherein emphasis is given to landmarks which are closer totheir predicted position.
 7. The method of claim 6 wherein saidweighting function comprises the following relationship: ##EQU6## where:g,h,i,j,k, and l are numerical constants;xp is the predicted xcoordinate location of said landmark; xm is the measured x coordinateposition of said landmark; yp is the predicted y coordinate location ofsaid landmark; and, ym is the measured y coordinate position of saidlandmark.
 8. The method of claim 7 further including the step of:i)updating said landmark locations in said reference array according tothe location of said landmarks in said current image,wherein saidupdating is performed based upon well identified landmarks and accordingto said landmark weighting function.
 9. The method of claim 8 furthercomprising the step ofi) establishing three types of reference arraysprior to broadcast including;i) a code reference array having landmarkcoordinates equal to said reference landmark coordinates, ii) a gamereference array having landmark coordinates initially set equal to saidcode reference array coordinates, and, iii) a tracking reference arrayhaving landmark coordinates initially set equal to said code referencearray coordinates.
 10. The method of claim 9 further comprising thesteps of:k) changing said tracking reference array of coordinates duringa broadcast; and, l) resetting the tracking reference array ofcoordinates to said game reference array of coordinates after a scenecut.
 11. The method of claim 10 wherein said video system is controlledby an operator and said method further comprises the step of:m)selectively choosing to set said current tracking reference array ofcoordinates equal to said game reference array of coordinates or to setsaid game reference array of coordinates back to said code referencearray of coordinates, wherein said operator can update or override thegame or tracking reference array of coordinates.
 12. The method of claim11 further comprising the steps of:n) establishing a set of sensorpoints in a pattern around the location of each said landmark saidsensor points being able to detect changes in color and illumination; o)determining if said sensor points are different in color or illuminationfrom the expected color or illumination; and, p) excluding said landmarkfrom future calculations if said color or illumination is substantiallydifferent from what was expected, wherein said landmark is deemed to beoccluded if said color or illumination at said sensor points issubstantially different from the expected color or illumination.
 13. Themethod of claim 12 wherein said correlation template is a 15 by 15 pixelwindow.
 14. The method of claim 1 wherein said mapping is achievedaccording to the following relationships;

    x'=a+bx

    y'=d+by

where: x is a horizontal coordinate in the reference array, y is avertical coordinate in the reference array, x' is a horizontalcoordinate in the current scene, y' is a vertical coordinate in thecurrent scene, b is a warp parameter for magnification between thereference array and the current image, a is a warp parameter forhorizontal translation of the object in the x direction, and, d is awarp parameter for vertical translation of the object in the ydirection.
 15. The method of claim 4 further comprising the steps of:q)searching for one of said landmarks in said current image by means ofcorrelation using a template where the starting point of the search issubstantially centered at the predicted location of said landmark; r)performing said search beginning from said predicted location andproceeding outward looking for a match; and, s) discontinuing saidsearch for said landmark when said match exceeds a threshold value. 16.The method of claim 6 wherein said weighting function comprises thefollowing relationship: ##EQU7## where: xp is the predicted x coordinatelocation of said landmark;xm is the measured x coordinate position ofsaid landmark; yp is the predicted y coordinate location of saidlandmark; and, ym is the measured y coordinate position of saidlandmark.
 17. A method of merging a primary video stream into asecondary video stream so that the combined video stream appears to havea common origin from video field to video field even as the primaryvideo stream is modulated by changes in camera orientation and settings,said apparent common origin achieved by using pattern recognitionanalysis of the primary video stream to stabilize and refine camerasensor data representing the orientation and settings of the primaryvideo stream source camera, said method comprising the steps of:t)acquiring camera sensor data from at least one camera outfitted withhardware sensors which measure the orientation and settings of thecamera, u) converting the camera sensor data to a format suitable fortransmission, v) transmitting the converted camera sensor data to a livevideo insertion system, w) converting the camera sensor data to affineform, x) predicting where landmarks in the previous field of video willbe in the current field of video based upon said camera sensor data, y)performing correlations to detect landmark positions centered aboutlandmark positions predicted by the camera sensor data, and z) creatinga model relating a reference field of video to the current field ofvideo using a weighted least mean square fit for all located landmarks.18. The method of claim 17 wherein the orientation and settings of saidat least one camera comprise focus, zoom, pan, and tilt.
 19. The methodof claim 17 wherein the format suitable for transmission is a numericseries obtained by converting the acquired camera sensor data from ananalog base to a digital base.
 20. A method of merging a primary videostream into a secondary video stream so that the combined video streamappears to have a common origin from video field to video field even asthe primary video stream is modulated by changes in camera orientationand settings, said apparent common origin achieved by using patternrecognition analysis of the primary video stream to stabilize and refinecamera sensor data representing the orientation and settings of theprimary video stream source camera, said method comprising the stepsof:aa) acquiring camera sensor data from at least one camera outfittedwith hardware sensors which measure the orientation and settings of thecamera, bb) converting the camera sensor data to a format suitable fortransmission, cc) transmitting the converted camera sensor data to alive video insertion system, dd) converting the camera sensor data toaffine form, ee) performing correlations to detect landmark positionscentered about landmark positions predicted by the camera sensor data,ff) creating virtual landmarks using said camera sensor data, saidvirtual landmarks appropriately weighted for camera sensor data error,and gg) creating a model relating a reference field of video to thecurrent field of video using a weighted least mean square fit for alllocated and virtual landmarks.
 21. The method of claim 20 wherein theorientation and settings of said at least one camera comprise focus,zoom, pan, and tilt.
 22. The method of claim 20 wherein the formatsuitable for transmission is a numeric series obtained by converting theacquired camera sensor data from an analog base to a digital base.
 23. Amethod for tracking motion from field to field in a sequence of relatedvideo images that are scanned by at least one camera having one or morehardware sensor devices, the method comprising the steps of:hh)obtaining a set of image templates from a current video image that meetcertain template capturing criteria and storing said image templates inmemory; ii) acquiring camera sensor data from said hardware sensingdevice, said camera sensor data representing the position andorientation of the camera; jj) using said camera sensor data indetermining the position of each stored image template with respect tothe current image; kk) calculating a transform model using thedetermined template position with respect to the current image, saidtransform model to be used to correspond reference position data tocurrent image position data; ll) purging image templates from memorythat do not meet certain template retention criteria; and mm) obtainingnew image templates from said current image to replace the imagetemplates that were purged.
 24. A method for tracking motion from fieldto field in a sequence of related video images that are scanned by atleast one camera having hardware sensor devices, said hardware sensordevices to include an accelerometer, the method comprising the stepsof:nn) establishing an array of idealized x and y coordinatesrepresenting a reference array having a plurality of landmarks whereeach landmark has unique x and y coordinates; oo) mapping x and ycoordinates in a current image to said x and y coordinates in saidreference array; pp) acquiring camera sensor data from said hardwaresensor devices, said camera sensor data representing the position,orientation, and oscillation of the camera; qq) predicting the futurelocation of said landmark coordinates, x' and y', using said camerasensor data, wherein prediction errors due to changes between twosuccessive fields are minimized by adding (i) the field to fielddifference in landmark location calculated from said camera sensor datato (ii) the landmark position x, y previously located.
 25. A method ofmerging a primary video stream into a secondary video stream so that thecombined video stream appears to have a common origin from video fieldto video field even as the primary video stream is modulated by changesin camera orientation and settings, said apparent common origin achievedby using pattern recognition analysis of the primary video stream tostabilize and refine camera sensor data representing the orientation andsettings of the primary video stream source camera, said methodcomprising the steps of:rr) obtaining a set of image templates from acurrent video image that meet certain template capturing criteria andstoring said image templates in memory, ss) acquiring camera sensor datafrom at least one camera outfitted with hardware sensors which measurethe orientation and settings of the camera, tt) converting the camerasensor data to a format suitable for transmission, uu) transmitting theconverted camera sensor data to a live video insertion system, vv)converting the camera sensor data to affine form, ww) predicting whereimage templates in the previous field of video will be in the currentfield of video based upon said camera sensor data, xx) performingcorrelations to detect image template positions centered about imagetemplate positions predicted by the camera sensor data, and yy) creatinga model relating a reference field of video to the current field ofvideo using a weighted least mean square fit for all image templates,zz) purging image templates from memory that do not meet certaintemplate retention criteria, and aaa) obtaining new image templates fromsaid current image to replace the image templates that were purged. 26.A method of merging a primary video stream into a secondary video streamso that the combined video stream appears to have a common origin fromvideo field to video field even as the primary video stream is modulatedby camera oscillation and changes in camera orientation and settings,said apparent common origin achieved by using pattern recognitionanalysis of the primary video stream to stabilize and refine camerasensor data representing the motion, orientation and settings of theprimary video stream source camera, said method comprising the stepsof:bbb) acquiring camera sensor data from at least one camera outfittedwith hardware sensors which measure the acceleration, orientation andsettings of the camera, ccc) converting the camera sensor data to aformat suitable for transmission, ddd) transmitting the converted camerasensor data to a live video insertion system, eee) converting the camerasensor data to affine form, fff) predicting where landmarks in theprevious field of video will be in the current field of video based uponsaid camera sensor data, ggg) performing correlations to detect landmarkpositions centered about landmark positions predicted by the camerasensor data, and hhh) creating a model relating a reference field ofvideo to the current field of video using a weighted least mean squarefit for all located landmarks.
 27. A method of merging a primary videostream into a secondary video stream so that the combined video streamappears to have a common origin from video field to video field even asthe primary video stream is modulated by changes in camera orientationand settings, said apparent common origin achieved by using patternrecognition analysis of the primary video stream to stabilize and refinecamera sensor data representing the orientation and settings of theprimary video stream source camera, said method comprising the stepsof:iii) obtaining a set of image templates from a current video imagethat meet certain template capturing criteria and storing said imagetemplates in memory, jjj) acquiring camera sensor data from at least onecamera outfit with hardware sensors which measure the orientation andsettings of the camera, kkk) converting the camera sensor data to aformat suitable for transmission, lll) transmitting the converted camerasensor data to a live video insertion system, mmm) converting the camerasensor data to affine form, nnn) performing correlations to detect imagetemplate positions centered about image template positions predicted bythe camera sensor data, ooo) creating virtual image templates using saidcamera sensor data, said virtual image templates appropriately weightedfor camera sensor data error, ppp) creating a model relating a referencefield of video to the current field of video using a weighted least meansquare fit for all located and virtual image templates, qqq) purgingimage templates from memory that do not meet certain template retentioncriteria, and rrr) obtaining new image templates from said current imageto replace the image templates that were purged.
 28. A method of merginga primary video stream into a secondary video stream so that thecombined video stream appears to have a common origin from video fieldto video field even as the primary video stream is modulated by cameraoscillation and changes in camera orientation and settings, saidapparent common origin achieved by using pattern recognition analysis ofthe primary video stream to stabilize and refine camera sensor datarepresenting the acceleration, orientation and settings of the primaryvideo stream source camera, said method comprising the steps of:sss)acquiring camera sensor data from at least one camera outfitted withhardware sensors which measure the acceleration, orientation andsettings of the camera, ttt) converting the camera sensor data to aformat suitable for transmission, uuu) transmitting the converted camerasensor data to a live video insertion system, vvv) converting the camerasensor data to affine form, www) performing correlations to detectlandmark positions centered about landmark positions predicted by thecamera sensor data, xxx) creating virtual landmarks using said camerasensor data, said virtual landmarks appropriately weighted for camerasensor data error, and yyy) creating a model relating a reference fieldof video to the current field of video using a weighted least meansquare fit for all located and virtual landmarks.
 29. A method ofmerging a primary video stream into a secondary video stream so that thecombined video stream appears to have a common origin from video fieldto video field even as the primary video stream is modulated by changesin camera orientation and settings, said apparent common origin achievedby using pattern recognition analysis of the primary video stream tostabilize and refine camera sensor data representing the orientation andsettings of the primary video stream source camera, said methodcomprising the steps of:zzz) acquiring camera sensor data from at leastone camera outfitted with hardware sensors which measure the orientationand settings of the camera, aaaa) converting the camera sensor data to aformat suitable for transmission, bbbb) transmitting the convertedcamera sensor data to a live video insertion system, cccc) convertingthe camera sensor data to a form and a coordinate system useable by thelive video insertion system, dddd) predicting where landmarks will be inthe current field of video based on said camera sensor data, eeee)creating a model relating a reference field of video to the currentfield of video using a weighted least mean squares fit for all locatedlandmarks, ffff) obtaining a set of image templates from a current videoimage that meet certain template capturing criteria and storing saidimage templates in memory, gggg) in subsequent fields of video using thepredicted positions of said image templates as a starting point todetermine the current position of each stored image template, hhhh) insubsequent fields of video calculating a transform model using thedetermined template positions to correspond reference position date toimage position data in those subsequent fields, iiii) purging imagetemplates from memory that do not meet certain template retentioncriteria, and jjjj) obtaining new image templates from said currentimage to replace the image templates that were purged.